PLSC30500, Fall 2024

Part 2. Summarizing distributions (part b)

Andy Eggers

Summarizing joint distributions

Motivation

Suppose we have two RVs \(X\) and \(Y\)

  • number of heads in one coin flip and number of green balls drawn from urn in 6 tries
  • age and height of randomly selected student
  • whether randomly selected citizen served in military and supports a foreign war

We know the joint PMF/PDF \(f(x, y)\) and joint CDF \(F(x, y)\).

How can we summarize the relationship between \(X\) and \(Y\)?

Covariance

\[\text{Cov}[X, Y] = E\left[ (X - E[X])(Y - E[Y]) \right]\]

Intuitively, “Does \(X\) tend to be above \(E[X]\) when \(Y\) is above \(E[Y]\)? (And by how much?)”

\[ f(x,y) = \begin{cases} 1/3 & x = 0, y = 0 \\\ 1/6 & x = 1, y = 0 \\\ 1/2 & x = 1, y = 1 \\\ 0 & \text{otherwise} \end{cases} \]

What is \(E[X]\)? What is \(E[Y]\)?

Then compute expectation of \((X - E[X])(Y - E[Y])\) (function of two RVs) as above.

Variance and covariance

Compare:

\[\begin{align}\text{Cov}[X, Y] &= E\left[ \color{blue}{(X - E[X])}\color{orange}{(Y - E[Y])} \right] \\ \text{V}[X] &= E\left[ \color{blue}{(X - E[X])}\color{blue}{(X - E[X])} \right]\end{align}\]

  • Variance of \(X\) is covariance between \(X\) and itself.
  • Variance can’t be negative but covariance can
  • A justification for \(^2\) in variance formula

Geometric representation (1)

Plot the points in \(\text{Supp}[X, Y]\) on two axes with point size proportional to \(f(x, y)\).

Divide the \(x, y\) plane into quadrants defined by \(x = E[X]\) and \(y = E[Y]\).

Geometric representation (2)

For each point \((x, y) \in \text{Supp}[X, Y]\), create a rectangle with \((x,y)\) at one corner and \((E[X], E[Y])\) at the opposite corner.

Shade the rectangle green in quadrants I and III (where \((x - E[X])(y - E[X]) > 0\)), otherwise red, with intensity proportional to \(f(x,y)\).

Covariance (roughly) measures how much green vs red there is.

Geometric representation (3)

Geometric representation (4)

Geometric representation (5)

Alternative formulation

First formulation:

\[\text{Cov}[X, Y] = E\left[ (X - E[X])(Y - E[Y]) \right]\]

As with variance, an alternative formulation:

\[\text{Cov}[X, Y] = E\left[XY\right] - E[X]E[Y]\]

Note:

  • if \(E[X] = E[Y] = 0\) (e.g. if recentered), both become \(E[XY]\)
  • geometrically, can think in terms of areas of rectangles

Geometry of \(E[XY] - E[X]E[Y]\)

Geometry of \(E[XY] - E[X]E[Y]\) (2)

Linearity of expectations, but not variances

If \(f\) is a linear function or linear operator, then \(f(x + y) = f(x) + f(y)\). (Additivity property.)


Recall linearity of expectations: \(E[X + Y] = E[X] + E[Y]\).


But \(\text{Var}[X + Y] \neq \text{Var}[X] + \text{Var}[Y]\)

Why not?

Variance rule (non-linearity of variance)

A different proof from A&R 2.2.3

\[\begin{aligned} \text{Var}(X+Y) &= E[(X + Y - E[X + Y])^2] \\\ &= E[(X - E[X] + Y - E[Y])^2] \\\ &= E[(\tilde{X} + \tilde{Y})^2] \\\ &= E[\tilde{X}^2 + \tilde{Y}^2 + 2 \tilde{X} \tilde{Y}] \\\ &= E[\tilde{X}^2] + E[\tilde{Y}^2] + E[2 \tilde{X} \tilde{Y}] \\\ &= E[(X - E[X])^2] + E[(Y - E[Y])^2] + 2E[(X - E[X])(Y - E[Y])] \\\ &= \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) \end{aligned}\]

Correlation

The correlation of two RVs \(X\) and \(Y\) with \(\sigma[X] > 0\) and \(\sigma[Y] > 0\) is

\[ \rho[X, Y] = \frac{\text{Cov}[X, Y]}{\sigma[X] \sigma[Y]}\]

Correlation is scale-invariant: \(\rho[X, Y] = \rho[aX, bY]\) for \(a, b > 0\)

Prove it!

Proof of scale-invariance of correlation

\[\begin{align} \text{Cov}[aX, bY] &= E[aX bY] - E[aX]E[bY] \\ &= ab E[XY] - ab E[X]E[Y] \\ &= ab (E[XY] - E[X]E[Y]) \\ &= ab \text{Cov}[X, Y] \end{align}\]

\[\sigma[aX] = \sqrt{\text{V}[aX]} = \sqrt{a^2 \text{V}[X]} = a \sigma[X]\]

By same argument, \(\sigma[bY] = b\sigma[Y]\).

So

\[\begin{align} \rho[aX, bY] &= \frac{\text{Cov}[aX, bY]}{\sigma[aX] \sigma[bY]} \\ &= \frac{ab \text{Cov}[X, Y]}{a \sigma[X] b \sigma[Y]} = \frac{\text{Cov}[X, Y]}{\sigma[X] \sigma[Y]} \\ &= \rho[X, y] \end{align}\]

Conditional expectations

We spent time on expectations:

\[E[Y] = \sum_y y f(y).\]

Also on conditional distributions:

\[f_{Y|X}(y|x) = \frac{f(x, y)}{f_X(x)}\]

Combining the two ideas, we get conditional expectations:

\[E[Y \mid X = x] = \sum_y y f_{Y|X}(y \mid x).\]

i.e. the expectation of \(Y\) at some \(x\).

Illustration

(Red line represents \(E[Y | X = x]\), dots a sample from \(f(x, y)\))

Illustration (2)

(Red line represents \(E[Y | X = x]\), dots a sample from \(f(x, y)\))

Conditional variance

Two formulations:

\[V[Y | X = x] = E[(Y - E[Y | X =x])^2 | X = x]\] \[V[Y | X = x] = E[Y^2 | X = x] - E[Y | X =x]^2\]

Conditional variance (2)

Two formulations:

\[V[Y | X = x] = E[(Y - E[Y | X =x])^2 | X = x]\]

\[V[Y | X = x] = E[Y^2 | X = x] - E[Y | X =x]^2\]

Conditional expectations vs Conditional expectation function (CEF)

Conditional expectation \(E[Y | X = x]\) is for a specific \(x\).

Conditional expectation function (CEF) \(E[Y | X]\) is for all \(x\).

CEF as best predictor

The CEF \(E[Y | X]\) is the expectation of \(Y\) at each \(X\).

We already established that the expectation/mean is the best (in MSE sense) predictor.

So CEF is the best possible way to use \(X\) to predict \(Y\). (See Theorem 2.2.20.)

Multivariate generalization: \(E[Y \mid X_1, X_2, X_3, \ldots, X_n]\) is the best way to use \(X_1, \ldots X_n\) to predict \(Y\).

Law of iterated expectations

For random variables \(X\) and \(Y\),

\[E[Y] = E[E[Y | X]]\]

This means there are two ways to get \(E[Y]\):

  • start with \(f(y)\), take expectations: \(E[Y] = \sum_y y f(y)\)
  • start with \(E[Y \mid X]\) and \(f_X(x)\), take expectations: \(E[Y] = \sum_x E[Y \mid X=x] f_X(x)\)

In words: An unconditional average (\(E[Y]\)) can be represented as a weighted average of conditional expectations (\(E[Y \mid X]\)) with weights taken from the distribution of the variable conditioned on, i.e. \(X\).

Why would you want to do that?

LIE: An intuitive example

A population is 80% female and 20% male.

The average age among females (\(E[Y | X = 1]\)) is 25. The average age among males \(E[Y | X = 0]\) is 20.

What is the average age in the population \(E[Y]\)?

\[E[E[Y | X]] = .8 \times 25 + .2 \times 20 = 24\]

See homework for another example.

LIE: another example

LIE: another example (2)

How LIE is used in causal inference (preview)

Suppose we want to measure the average effect of participating in a program (e.g. job training, voter education, military mobilization).

Call \(Y\) the (unobservable) effect of the treatment. We want the average treatment effect (ATE), \(E[Y]\).

Suppose that comparing participants and non-participants gives us a good estimate of the average treatment effect only within subgroups defined by age (\(X\)).

So we have \(E[Y \mid X]\).

Now we just combine these estimates (by LIE): \(E[Y] = E[E[Y \mid X]] = \sum_{x} E[Y \mid X = x] f(x)\)

Law of total variance

\[V[Y] = E[V[Y|X]] + V[E[Y|X]]\]

In words, the variance of \(Y\) can be decomposed into the expected conditional variance (\(E[V[Y|X]]\)) and the variance of the conditional expectation (\(V[E[Y|X]]\)).

Sometimes called “Ev(v)e’s law” because

\[V[Y] = \color{red}{E}[\color{red}{V}[Y|X]] + \color{red}{V}[\color{red}{E}[Y|X]]\]

Law of total variance (2)

Best linear predictor (BLP)

Suppose we want to predict \(Y\) using \(X\), and we focus on a linear predictor, i.e. a function of the form \(\alpha + \beta X\).

The best (minimum MSE) predictor satisfies

\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, \mathrm{E}\,[\left(Y - (a + bX)\right)^2]\]

The solution (see Theorem 2.2.21) is

  • \(\beta = \frac{\textrm{Cov}[X, Y]}{\textrm{V}[X]}\)
  • \(\alpha = \textrm{E}[Y] - \beta \textrm{E}[X]\)

So we could obtain the BLP from a joint PMF. (See homework.)

BLP predicts CEF

Above, we were looking for best linear predictor (BLP) of \(Y\) as function of \(X\):

\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, \mathrm{E}\,[\left(Y - (a + bX)\right)^2]\]

Same answer if you look for the best linear predictor of the CEF \(E[Y | X]\):

\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, \mathrm{E}\,[\left(\mathrm{E}[Y|X] - (a + bX)\right)^2]\]